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ABSTRACT 

Motivation: Ancient DNA (aDNA) molecules in fossilized bones and 
teeth, coprolites, sediments, mummified specimens and museum col- 
lections represent fantastic sources of information for evolutionary 
biologists, revealing the agents of past epidemics and the dynamics 
of past populations. However, the analysis of aDNA generally faces 
two major issues. Firstly, sequences consist of a mixture of endogen- 
ous and various exogenous backgrounds, mostly microbial. Secondly, 
high nucleotide misincorporation rates can be observed as a result of 
severe post-mortem DNA damage. Such misincorporation patterns 
are instrumental to authenticate ancient sequences versus modern 
contaminants. We recently developed the user-friendly mapDamage 
package that identifies such patterns from next-generation sequen- 
cing (NGS) sequence datasets. The absence of formal statistical mod- 
eling of the DNA damage process, however, precluded rigorous 
quantitative comparisons across samples. 

Results: Here, we describe mapDamage 2.0 that extends the original 
features of mapDamage by incorporating a statistical model of DNA 
damage. Assuming that damage events depend only on sequencing 
position and post-mortem deamination, our Bayesian statistical frame- 
work provides estimates of four key features of aDNA molecules: the 
average length of overhangs (A), nick frequency (v) and cytosine de- 
amination rates in both double-stranded regions (<5 d ) and overhangs (S s ). 
Our model enables rescaling base quality scores according to their 
probability of being damaged. mapDamage 2.0 handles NGS datasets 
with ease and is compatible with a wide range of DNA library protocols. 
Availability: mapDamage 2.0 is available at ginolhac.github. 
io/mapDamage/ as a Python package and documentation is main- 
tained at the Centre for GeoGenetics Web site (geogenetics.ku.dk/ 
publications/mapdamage2.0/). 
Contact: jonsson.hakon@gmail.com 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

DNA in historical samples is subject to a plethora of environ- 
mental conditions and degradation reactions (Sawyer et al,, 
2012). Abasic sites, strand breaks, interstrand cross-links and a 
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wide diversity of atypic nucleotidic bases are formed following 
oxidative and hydrolytic degradation (Lindahl, 1993; Paabo 
et ah, 2004), even in the most favorable preservation conditions. 

Post-mortem DNA damage limits our ability to access ancient 
DNA (aDNA) sequences and increases the risk of exogenous 
modern contamination, as undamaged DNA molecules are 
more prone to enzymatic manipulation. Nucleotide misincor- 
poration patterns, which are mostly driven by deaminated 
forms of cytosines (uracils), have been suggested as a powerful 
approach to authenticate aDNA sequences generated on next- 
generation sequencing (NGS) platforms (Briggs et al., 2007) and 
motivated the creation of the mapDamage package (Ginolhac 
et al., 2011). Such patterns could vary according to the specific 
molecular approach used for constructing (Meyer et al., 2012) 
and/or amplifying (Ginolhac et al., 2011) second-generation 
DNA libraries. For instance, for one of the most popular proto- 
cols (Meyer and Kircher, 2010), we observe inflated cytosine 
deamination rates at 5'-overhangs, an increase in C — > T substi- 
tution rates toward sequencing starts and complementary in- 
crease in G — > A rates toward reads ends (Briggs et al., 2007). 
Conversely, a novel procedure targeting single-stranded tem- 
plates has shown elevated C — >• T substitution rates at both 
ends (Meyer et al., 2012). 

Statistical modeling of such patterns has been developed by 
Briggs et al., 2007 with strand break, overhangs and cytosine 
deamination as key factors. Using read alignment to reference 
genomes and maximum likelihood optimization, this approach 
has delivered the first quantitative estimates of damage param- 
eters. However, the likelihood framework originally implemented 
scales poorly with the size of NGS datasets, and extensive run- 
ning times have prevented common usage. Here, we present an 
extension of mapDamage, which implements a fast approxima- 
tion of the DNA damage model using a Bayesian framework. 
mapDamage 2.0 opens the possibility of comparing DNA 
damage levels across temporal and environmental gradients. 
Posterior distributions of damage parameters also enable pena- 
lizing the quality score of likely damaged bases, reducing noise in 
downstream single-nucleotide polymorphism (SNP) calling 
procedures. 

2 APPROACH 

Here we build on the DNA damage model described in Briggs 
et al., 2007. We make the simplifying assumption that mutations 
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and post-mortem DNA damage are independent within a frag- 
ment, with occurrences depending only on the relative position 
from the sequence ends. 



We can now correct base quality scores provided in alignment BAM files 
[/>err(', '') at position i for read r] using 

P'&t&r) = 1 - (1 -Pm{i,r)){\ -/>dam(0) 
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3 METHODS 

The general idea is to mutate bases following an Hasegawa, Kishino and 
Yano (HKY) transition matrix (Hasegawa el a!., 1985) and then inde- 
pendently add post-mortem damage on top of mutated bases. In this 
framework, we have multinomial distributions describing the position- 
specific substitutions for any given base (Saj, Scj, Sg,i and St,,)- 

S AJ ~ Mul(A<,(l,0, 0, 0) ■ p) ■ P iam (&,,, S 3 , A, v, 0) 

0 is the HKY transition matrix, and /'dam is defined as the DNA damage 
transition matrix. We assume post-mortem cytosine deamination is the 
main driver of nucleotide misincorporations in agreement with experi- 
mental evidence (Briggs et al., 2007), providing 
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Where the base-specific damage probabilities are defined as 

Pe,{h, S„ K v, i) = V t (XiS s + Sj(l - A.;)) 
p ga (S d , S s , A, v, 0 = (1 - vdfaS, + «</0 - */)) 

The motivation for the base-specific damage probabilities p Aam is best 
explained by the Markov chain in Figure 1 where the first jump decides if 
the position is before or after a nick; then aC->T substitution could be 
observed following deamination in overhang or double-stranded DNA 
regions. A similar Markov chain could be drawn for G -* A substitu- 
tions (Supplementary Section 1). 

For rescaling base quality scores, we assume that C ->• T and G — >• A 
substitutions either originate from true biological differences or from 
damage driven misincorporations. We can derive an estimate for the 
probability that a C -> T (similar for G -* A) misincorporation at pos- 
ition i along the reads is due to damage using 
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Fig. 1. A schematic view describing the DNA damage Markov chain, 
which extends the DNA substitution model. The states C en d and T cll d 
correspond to the final nucleotides in the sequences 



4 DISCUSSION 

We applied mapDamage2.0 on a series of aDNA sequence data- 
sets generated from a range of periods, source materials and 
environments (Supplementary Section 3). Posterior predictive 
intervals and empirical frequencies are in general agreement, as 
shown for the ancient plague dataset (Supplementary Table S2 
and Supplementary Figs S4-S9) (Schuenemann et al., 2011), 
demonstrating the adequacy of our method. We observed a 
ratio of cytosine deamination rates for double- and single- 
stranded regions orders of magnitude greater than estimates 
based on in vitro experiments in aqueous solution (0.007 in 
Lindahl, 1993 versus 0.026-0.070 for Schuenemann et al, 2011 
in Supplementary Table SI). This suggests that tissue- and 
sample-specific micro-environmental characteristics drive differ- 
ent DNA damage kinetics in situ. We also found a significant 
rank correlation between the posterior mean for single-stranded 
cytosine deamination and sample age (Supplementary Table S3) 
in agreement with Sawyer et al, 2012. However, remains of simi- 
lar age and location showed diverse parameter estimates 
(Supplementary Table S2), suggesting a prominent role of 
micro-environmental characteristics over age in diagenesis. 

We also applied our quality rescaling scheme to the sequence 
data of an Australian Aboriginal individual who died in 1920s 
(Rasmussen et al., 2011). This increased the overlap of genotype 
calls to dbSNP vl37, suggesting that lower false-positive SNP 
calls were achieved (Supplementary Table S4). 



5 CONCLUSION 

We have developed a computational method for inferring 
aDNA damage parameters from NGS sequence datasets, with 
minimal changes to the DNA damage model presented by Briggs 
et ah, 2007. Our model is compatible with the specificities of 
different sequencing and library building protocols. We believe 
that downscaling quality scores of likely damaged bases is the 
first from a long list of possible applications for damage param- 
eter posterior distributions, limiting the impact of nucleotide 
misincorporations in downstream sequence analyses. The know- 
ledge of such distributions could also be instrumental for im- 
proving mapping procedures to reference genomes (Schubert 
et al, 2012). 
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